Introduzione alla programmazione con Triton: Oltre le operazioni element-wise: Il passaggio alle operazioni matriciali a blocchi

In lezioni precedenti, ci siamo concentrati su operazioni element-wise (come un'operazione ReLU di base su una matrice). Queste sono limitate dalla memoria perché la GPU trascorre più tempo nel trasferimento dei dati dalla HBM ai registri che nell'esecuzione delle operazioni matematiche.

1. Perché GEMM è centrale

La moltiplicazione generale di matrici (GEMM) ha una complessità computazionale di $O(N^3)$ pur richiedendo solo $O(N^2)$ accessi alla memoria. Ciò ci permette di nascondere la latenza della memoria dietro un'ampia capacità di elaborazione aritmetica, rendendola il "battito" dei modelli linguistici (LLMs).

2. Rappresentazione della memoria in 2D

La RAM fisica è monodimensionale. Per rappresentare un tensore in 2D, usiamo passi (strides). Un errore comune in produzione è supporre che un tensore sia contiguo. Se confondi i passi riga e colonna nella tua matematica dei puntatori, accederai a dati "fantasma" o innesterai violazioni di memoria.

3. Generalizzazione tramite blocchi

Triton generalizza la logica element-wise passando da puntatori singoli a blocchi di puntatori. Usando tile bidimensionali (ad esempio $16 \times 16$), sfruttiamo riutilizzo dei dati nella SRAM ad alta velocità, mantenendo i dati "caldi" per operazioni fuse come l'aggiunta del bias o le attivazioni prima di scrivere nuovamente nella memoria globale.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

Why is an elementwise ReLU on a large matrix considered 'memory-bound'?

The ReLU function requires complex transcendental math.

The ratio of arithmetic operations to memory loads is very low (1:1).

Matrices are naturally stored in CPU memory only.

Triton cannot process non-linear activations.

QUESTION 2

What is the result of 'The Stride Trap' in production kernels?

The kernel runs significantly faster but with less precision.

Memory access violations or corrupted output due to incorrect address calculation on non-contiguous tensors.

The GPU automatically corrects the indexing using L2 cache.

The tensor is forced into a 1D shape by the compiler.

QUESTION 3

How does Triton represent a 2D tile of pointers?

By using a nested Python list of integers.

By broadcasting a 1D column vector and a 1D row vector of offsets together.

By launching multiple 1D kernels sequentially.

By allocating a special 2D register file.

QUESTION 4

Which operation benefits most from the O(N³) complexity shift to hide memory latency?

Vector Addition

Matrix Multiplication (GEMM)

Sigmoid Activation

Global Average Pooling

QUESTION 5

List three kernels in your current workflow that launch multiple PyTorch ops and might benefit from fusion.

Linear -> Bias -> ReLU; LayerNorm -> Dropout; Softmax -> Masking.

Print -> Log -> Sleep.

DataLoader -> Augmentation -> Storage.

These ops cannot be fused in Triton.